138 ◾ Bioinformatics
files, we can use “-V” for each one. Instead of using “-V” option several times for multiple
gVCF files, the sample information can be saved in a text file called cohort sample map file.
The file then can be passed in “--sample-name-map” option. The cohort sample map file is
a plain text file that contains two tab-separated columns; the first column is for the sample
IDs and the second column is for the names of the gVCF files. Each sample ID is mapped
to a sample file name as shown in Figure 4.6.
The cohort sample map file can be created manually by the user. However, we can also
use bash script to create it. The following script creates a cohort sample map file for our 13
samples and the file will be as shown in Figure 4.6:
cd gvcf
#a- make file name and absolute path
find “$PWD”/*_chr21.dedup.RG.bqsr.g.vcf.gz -type f -printf ‘%f
%h/%f\n’ > ../tmp.txt
#b- remove _1/2.fastq
awk ‘{ gsub(/_chr21.dedup.RG.bqsr.g.vcf.gz/,”,”, $1); print } ‘
../tmp.txt > ../tmp2.txt
rm ../tmp.txt
#remove space
cat ../tmp2.txt | sed -r ‘s/\s+//g’ > ../tmp3.txt
rm ../tmp2.txt
#replace comma with tab
sed -e ‘s/\,\+/\t/g’ ../tmp3.txt > ../cohort.sample_map
rm ../tmp3.txt
Once we have created the cohort sample map file, we can run GenomicsDBImport tool
to import gVCF sample files and GenotypeGVCFs tool to consolidate the variants of 13
samples in a single VCF file.
#create a database
ref=$(ls ../refgenome/*.fasta)
~/software/gatk-4.2.3.0/gatk \
--java-options -Xmx10g \
GenomicsDBImport \
FIGURE 4.6 Cohort sample map file.